Seoul Bike Rent

Authors: Bordas Alexandre and Carval Nicolas

Here is our work for this dataset named "Seoul Bike Sharing Demand".

This dataset contains count of public bikes rented at each hour in Seoul Bike sharing System with the corresponding Weather data and Holidays information.

Currently Rental bikes are introduced in many urban cities for the enhancement of mobility comfort. It is important to make the rental bike available and accessible to the public at the right time as it lessens the waiting time. Eventually, providing the city with a stable supply of rental bikes becomes a major concern. The crucial part is the prediction of bike count required at each hour for the stable supply of rental bikes.

Thus we will explore this dataset in order to show the link between the number of bikes rented and the other variables present in the dataset. We will then elaborate a machine learning model in order to provide a way to predict more or less the number ok bikes that could be rented under specific conditions.

You can find the dataset here: https://archive.ics.uci.edu/ml/datasets/Seoul+Bike+Sharing+Demand

You'll find:

  • first approach of the data
  • Preprocessing of the data
  • Visualisation
  • Machine Learning

Acknowledgement:

Libraries

In [1]:
# Data manipulation libraries
import pandas as pd
import numpy as np

# Visualisation libraries
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import plotly.express as px
import plotly.graph_objects as go

# Interactive visualisation libraries
from ipywidgets import widgets
from ipywidgets import interact

Dataset

In [2]:
df = pd.read_csv("SeoulBikeData.csv", encoding='latin1')
df.head()
Out[2]:
Date Rented Bike Count Hour Temperature(°C) Humidity(%) Wind speed (m/s) Visibility (10m) Dew point temperature(°C) Solar Radiation (MJ/m2) Rainfall(mm) Snowfall (cm) Seasons Holiday Functioning Day
0 01/12/2017 254 0 -5.2 37 2.2 2000 -17.6 0.0 0.0 0.0 Winter No Holiday Yes
1 01/12/2017 204 1 -5.5 38 0.8 2000 -17.6 0.0 0.0 0.0 Winter No Holiday Yes
2 01/12/2017 173 2 -6.0 39 1.0 2000 -17.7 0.0 0.0 0.0 Winter No Holiday Yes
3 01/12/2017 107 3 -6.2 40 0.9 2000 -17.6 0.0 0.0 0.0 Winter No Holiday Yes
4 01/12/2017 78 4 -6.0 36 2.3 2000 -18.6 0.0 0.0 0.0 Winter No Holiday Yes
In [3]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8760 entries, 0 to 8759
Data columns (total 14 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Date                       8760 non-null   object 
 1   Rented Bike Count          8760 non-null   int64  
 2   Hour                       8760 non-null   int64  
 3   Temperature(°C)            8760 non-null   float64
 4   Humidity(%)                8760 non-null   int64  
 5   Wind speed (m/s)           8760 non-null   float64
 6   Visibility (10m)           8760 non-null   int64  
 7   Dew point temperature(°C)  8760 non-null   float64
 8   Solar Radiation (MJ/m2)    8760 non-null   float64
 9   Rainfall(mm)               8760 non-null   float64
 10  Snowfall (cm)              8760 non-null   float64
 11  Seasons                    8760 non-null   object 
 12  Holiday                    8760 non-null   object 
 13  Functioning Day            8760 non-null   object 
dtypes: float64(6), int64(4), object(4)
memory usage: 958.2+ KB

Every columns are in the correct format except the Date column, thus we'll convert it later.

In [4]:
df.describe()
Out[4]:
Rented Bike Count Hour Temperature(°C) Humidity(%) Wind speed (m/s) Visibility (10m) Dew point temperature(°C) Solar Radiation (MJ/m2) Rainfall(mm) Snowfall (cm)
count 8760.000000 8760.000000 8760.000000 8760.000000 8760.000000 8760.000000 8760.000000 8760.000000 8760.000000 8760.000000
mean 704.602055 11.500000 12.882922 58.226256 1.724909 1436.825799 4.073813 0.569111 0.148687 0.075068
std 644.997468 6.922582 11.944825 20.362413 1.036300 608.298712 13.060369 0.868746 1.128193 0.436746
min 0.000000 0.000000 -17.800000 0.000000 0.000000 27.000000 -30.600000 0.000000 0.000000 0.000000
25% 191.000000 5.750000 3.500000 42.000000 0.900000 940.000000 -4.700000 0.000000 0.000000 0.000000
50% 504.500000 11.500000 13.700000 57.000000 1.500000 1698.000000 5.100000 0.010000 0.000000 0.000000
75% 1065.250000 17.250000 22.500000 74.000000 2.300000 2000.000000 14.800000 0.930000 0.000000 0.000000
max 3556.000000 23.000000 39.400000 98.000000 7.400000 2000.000000 27.200000 3.520000 35.000000 8.800000
In [5]:
df.isna().sum()
Out[5]:
Date                         0
Rented Bike Count            0
Hour                         0
Temperature(°C)              0
Humidity(%)                  0
Wind speed (m/s)             0
Visibility (10m)             0
Dew point temperature(°C)    0
Solar Radiation (MJ/m2)      0
Rainfall(mm)                 0
Snowfall (cm)                0
Seasons                      0
Holiday                      0
Functioning Day              0
dtype: int64

No information is missing.

In [6]:
print(df.groupby("Date")[["Hour","Seasons"]].count().shape[0],"Days in the year, which is correct. No days missing")
365 Days in the year, which is correct. No days missing

looking for outliers :

In [7]:
df[(df['Rented Bike Count']==0)][["Rented Bike Count","Functioning Day"]]
Out[7]:
Rented Bike Count Functioning Day
3144 0 No
3145 0 No
3146 0 No
3147 0 No
3148 0 No
... ... ...
8251 0 No
8252 0 No
8253 0 No
8254 0 No
8255 0 No

295 rows × 2 columns

We see that when Rented Bike Count is equal to 0 it corresponds to non functionning day.

Maybe by looking to every non functionning day we can get more information :

In [8]:
df[df["Functioning Day"]=='No'].equals(df[(df['Rented Bike Count']==0)])
Out[8]:
True

In fact there are 13 Days in the year that are non functionning. These days are representing all the days without bike rented. So we'll do some treatment on it

In [9]:
print("Number of days within the year :",len(df.groupby("Date")))
df[df["Functioning Day"]=='No'].groupby("Date").count()
Number of days within the year : 365
Out[9]:
Rented Bike Count Hour Temperature(°C) Humidity(%) Wind speed (m/s) Visibility (10m) Dew point temperature(°C) Solar Radiation (MJ/m2) Rainfall(mm) Snowfall (cm) Seasons Holiday Functioning Day
Date
02/10/2018 24 24 24 24 24 24 24 24 24 24 24 24 24
03/11/2018 24 24 24 24 24 24 24 24 24 24 24 24 24
04/10/2018 24 24 24 24 24 24 24 24 24 24 24 24 24
06/10/2018 7 7 7 7 7 7 7 7 7 7 7 7 7
06/11/2018 24 24 24 24 24 24 24 24 24 24 24 24 24
09/10/2018 24 24 24 24 24 24 24 24 24 24 24 24 24
09/11/2018 24 24 24 24 24 24 24 24 24 24 24 24 24
10/05/2018 24 24 24 24 24 24 24 24 24 24 24 24 24
11/04/2018 24 24 24 24 24 24 24 24 24 24 24 24 24
18/09/2018 24 24 24 24 24 24 24 24 24 24 24 24 24
19/09/2018 24 24 24 24 24 24 24 24 24 24 24 24 24
28/09/2018 24 24 24 24 24 24 24 24 24 24 24 24 24
30/09/2018 24 24 24 24 24 24 24 24 24 24 24 24 24

PreProcessing

We keep only functionning days. As we know that it will always be 0 for non Functionning days. Thus we drop the column.

In [10]:
df = df[df["Functioning Day"]!='No']
In [11]:
df.drop(columns=["Functioning Day"],inplace=True)
df.head()
Out[11]:
Date Rented Bike Count Hour Temperature(°C) Humidity(%) Wind speed (m/s) Visibility (10m) Dew point temperature(°C) Solar Radiation (MJ/m2) Rainfall(mm) Snowfall (cm) Seasons Holiday
0 01/12/2017 254 0 -5.2 37 2.2 2000 -17.6 0.0 0.0 0.0 Winter No Holiday
1 01/12/2017 204 1 -5.5 38 0.8 2000 -17.6 0.0 0.0 0.0 Winter No Holiday
2 01/12/2017 173 2 -6.0 39 1.0 2000 -17.7 0.0 0.0 0.0 Winter No Holiday
3 01/12/2017 107 3 -6.2 40 0.9 2000 -17.6 0.0 0.0 0.0 Winter No Holiday
4 01/12/2017 78 4 -6.0 36 2.3 2000 -18.6 0.0 0.0 0.0 Winter No Holiday
In [12]:
print("Nombre de jours restant dans l'année :",len(df.groupby("Date")))
Nombre de jours restant dans l'année : 353

We have 353 days left, which corresponds to 365-12. There is no mistake. In fact if you look back at all the non functionning days, you'll notice that for 06/10/2018 there are only 7 hour in that day that aren't functionning.

Now we want to convert the date into the correct format :

In [13]:
df['Date'] =  pd.to_datetime(df['Date'],format="%d/%m/%Y")
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 8465 entries, 0 to 8759
Data columns (total 13 columns):
 #   Column                     Non-Null Count  Dtype         
---  ------                     --------------  -----         
 0   Date                       8465 non-null   datetime64[ns]
 1   Rented Bike Count          8465 non-null   int64         
 2   Hour                       8465 non-null   int64         
 3   Temperature(°C)            8465 non-null   float64       
 4   Humidity(%)                8465 non-null   int64         
 5   Wind speed (m/s)           8465 non-null   float64       
 6   Visibility (10m)           8465 non-null   int64         
 7   Dew point temperature(°C)  8465 non-null   float64       
 8   Solar Radiation (MJ/m2)    8465 non-null   float64       
 9   Rainfall(mm)               8465 non-null   float64       
 10  Snowfall (cm)              8465 non-null   float64       
 11  Seasons                    8465 non-null   object        
 12  Holiday                    8465 non-null   object        
dtypes: datetime64[ns](1), float64(6), int64(4), object(2)
memory usage: 925.9+ KB

Adding a month column in order to analyse more closely later :

In [14]:
df["Month"] = df["Date"].apply(lambda x: x.strftime("%B"))
df.tail()
Out[14]:
Date Rented Bike Count Hour Temperature(°C) Humidity(%) Wind speed (m/s) Visibility (10m) Dew point temperature(°C) Solar Radiation (MJ/m2) Rainfall(mm) Snowfall (cm) Seasons Holiday Month
8755 2018-11-30 1003 19 4.2 34 2.6 1894 -10.3 0.0 0.0 0.0 Autumn No Holiday November
8756 2018-11-30 764 20 3.4 37 2.3 2000 -9.9 0.0 0.0 0.0 Autumn No Holiday November
8757 2018-11-30 694 21 2.6 39 0.3 1968 -9.9 0.0 0.0 0.0 Autumn No Holiday November
8758 2018-11-30 712 22 2.1 41 1.0 1859 -9.8 0.0 0.0 0.0 Autumn No Holiday November
8759 2018-11-30 584 23 1.9 43 1.3 1909 -9.3 0.0 0.0 0.0 Autumn No Holiday November

We already know what to expect of standard columns such as date, hour, seasons and others. However some may be interesting to analyse aside from others. As they may not be that much variated. So we're looking at these columns more closely :

In [15]:
fig, ax = plt.subplots(figsize=(20,20),nrows=2,ncols=2)

df["Snowfall (cm)"].plot.hist(ax=ax[0,0],grid=True, bins=20, rwidth=0.9, color='#b9d9fa')
df["Rainfall(mm)"].plot.hist(ax=ax[0,1],grid=True,bins=20, rwidth=0.9, color='blue')
df["Visibility (10m)"].plot.hist(ax=ax[1,0],grid=True,bins=20, rwidth=0.9, color='grey')
df["Solar Radiation (MJ/m2)"].plot.hist(ax=ax[1,1],grid=True,bins=20, rwidth=0.9,xlabel="Solar Radiation (MJ/m2)",
                                        color='red')

ax[0,0].set_title("Frequency of Snow fall in cm over the year")
ax[0,0].set_xlabel("Snowfall (cm)")

ax[0,1].set_title("Frequency of Rainfall in mm over the year")
ax[0,1].set_xlabel("Rainfall(mm)")

ax[1,0].set_title("Frequency of Visibility in 10m over the year")
ax[1,0].set_xlabel("Visibility (10m)")

ax[1,1].set_title("Frequency of Solar Radiation in MJ/m2 over the year")
ax[1,1].set_xlabel("Solar Radiation (MJ/m2)")

fig.suptitle('Histograms of different columns', fontsize=20,y=0.92);

Snow and Rain values aren't much varied, thus lets create 2 new columns for rain and snow to convert them into categorical variables

In [16]:
day = df.groupby("Date").mean()
print("Not snowy :",day[day["Snowfall (cm)"]==0].shape[0]," days")
print("Not rainny :",day[day["Rainfall(mm)"]==0].shape[0], "days")
Not snowy : 326  days
Not rainny : 253 days
In [17]:
df["Snow"] = (df["Snowfall (cm)"]!=0)
df["Rain"] = df["Rainfall(mm)"].apply(lambda x: "No rain" 
                                      if x==0 else ("light rain" if x<3 
                                                    else ("Medium rain" if x<8 
                                                          else "Strong rain")))

Now lets check the correlation between every variables :

In [18]:
corr= df.corr()
mask = np.zeros_like(corr)
mask[np.triu_indices_from(mask)] = True

with sns.axes_style("white"):
    f, ax = plt.subplots(figsize=(10, 10))
    ax = sns.heatmap(abs(corr), mask=mask, vmax=1, square=True,cmap="Reds")
    ax.set_title("Correlation matrix of the Data set")
In [19]:
corr_matrix = df.corr()[['Rented Bike Count']].sort_values(by = ['Rented Bike Count'], ascending = False).drop(['Rented Bike Count'])
corr_matrix.style.background_gradient(cmap = 'coolwarm').format(precision=2)
Out[19]:
  Rented Bike Count
Temperature(°C) 0.56
Hour 0.43
Dew point temperature(°C) 0.40
Solar Radiation (MJ/m2) 0.27
Visibility (10m) 0.21
Wind speed (m/s) 0.13
Rainfall(mm) -0.13
Snowfall (cm) -0.15
Snow -0.20
Humidity(%) -0.20

For our regression model we will decide to drop dew point as it is too corrolated to temperature. And temperature is better correlated with rented bike count. we notice an improvement of the correlation between Snow info and Rented bike count, as it is now a boolean (Snowfall (cm) vs Snow).

Visualisation

Seasonal plots
In [20]:
fig = px.scatter(df.sort_values(by="Date"),x="Date",y="Rented Bike Count",color="Seasons",
                 title="Scatterplot of Rented Bike Count by every day (including every hour)")
fig.show()

We can cleary see that winter is the season that stands out of the others. The values are massed together from 0 to 500 bikes per hour. Which is pretty low compared to other seasons.

We can also see that there is much more volatility when we get closer to summer, values are really disparsed and can reach 3500 bikes per hour.

In [21]:
temp = df.groupby(["Holiday","Seasons"])["Rented Bike Count"].mean().reset_index()
fig = px.bar(temp,x="Holiday", y="Rented Bike Count", color="Seasons",
             title="Barplot of the mean of rented bike count whether it is holiday or not, separated by seasons")
fig.show()

We decided to plot the mean as the number of vacation days is really low compared to working days.

Thus, we see that holidays don't have much influence on the mean of rented bike, even if there is a number of bikes rented in holidays that is a little bit lower, for each seasons.

In [22]:
fig = px.pie(df.groupby("Seasons")["Rented Bike Count"].sum().reset_index(),values="Rented Bike Count",
             names='Seasons',title="Pie Chart representing the proportion of total rented bike by seasons")
fig.show()

As we expected from the previous graphs, winter doesn't even represent 1/10 of the total bikes rented. And Summer is the dominant season with more than 1/3 of the proportion.

Plots by hour
In [23]:
temp1 = (df[df["Holiday"]=="No Holiday"].groupby(["Hour","Holiday"])["Rented Bike Count"].sum()*100/sum(df[df["Holiday"]=="No Holiday"]["Rented Bike Count"])).reset_index()
temp2 = (df[df["Holiday"]!="No Holiday"].groupby(["Hour","Holiday"])["Rented Bike Count"].sum()*100/sum(df[df["Holiday"]!="No Holiday"]["Rented Bike Count"])).reset_index()
temp = temp1.append(temp2)

temp.rename(columns={'Rented Bike Count': '% of rented bike count'}, inplace=True)

fig = px.bar(temp, x="Hour", y="% of rented bike count", color = "Holiday", 
             title='Proportion of rented bike per hour, taking in count if we are in holiday or not',
             color_discrete_sequence=['indianred', 'lightsalmon'])
fig.update_layout(barmode='group')
fig.show()

Every columns for holiday adds up to 1, same for non holiday columns.

This graph enables us to understand the trend underlying holidays. As there are 10x more non holiday row, we don't see the impact if we ponderate the values and then can see the trend.

Conclusion: It follows the same stable pattern for both categories. A drop during the night. And a pick during afternoon. However we can see that for working days, 8h and 18h are much more dominant. this is explained by the beginning and the end of days for most workers. They are travelling home or going to work. Which of course isn't the same during holidays.

In [24]:
temp = df.groupby(["Rain","Hour"])["Rented Bike Count"].mean().reset_index()
fig = px.line(temp, x="Hour", y="Rented Bike Count", color='Rain',title='Line plot of the mean of rented bikes per hour taking in count the Rain')
fig.show()

It doesn't rain much in the year. Thus we use the mean of rented bikes to interpret the trend of this column.

We can clearly see that the rain affects the number of rented bikes. When it rains, people don't rent bikes and probably use public transports. However we can still distinguish the 8h and 18h pics when it rains.

In [25]:
temp = df.groupby(["Seasons","Hour"])["Rented Bike Count"].mean().reset_index()
fig = px.line(temp, x="Hour", y="Rented Bike Count", color='Seasons',title='Line plot of the mean of rented bikes per hour taking in count seasons')
fig.show()

Same as before, we clearly see the trend that happens at 8h and 18h every day. We also see that every seasons seems to follow the same pattern except Winter. However at night we see that there are more bike rented in summer, it may be due to the higher temperature at night and the sun going down later.

Plot of temperature
In [26]:
# temperature
temp = df.groupby(["Temperature(°C)","Seasons"])["Rented Bike Count"].sum().reset_index()
px.scatter(data_frame = temp
           ,x = 'Temperature(°C)'
           ,y = 'Rented Bike Count',color="Seasons",title='Scatter plot of the total number of bikes rented by temperature and season')

we can see that the temperature is very correlated to the rented bikes, as it gets higher and closer to 20 °C the number of bikes rented are higher in general. If it surpass 30 °C then it decreases drastically as the temperature is maybe to hot for people to go biking.

In winter temperatures are low thus rented bike number is low, and we get the same conclusion as before for the other seasons.

Plot of visibility
In [27]:
temp = df.groupby("Visibility (10m)")["Rented Bike Count"].mean().reset_index()
fig = px.scatter(temp, x="Visibility (10m)", y="Rented Bike Count",
           hover_name="Rented Bike Count", size_max=60,trendline="ols",trendline_color_override="red", title='Scatter plot of the Rented bike mean by visibility, with ordinal least square trend line')
fig.show()
C:\Users\Nicolas Carval\Anaconda3\lib\site-packages\statsmodels\tools\_testing.py:19: FutureWarning:

pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead.

It not really helpful for us, it doesn't seems to be an important variable.

Interactive Plots
  • Simple Scatterplots in order to show the correlation of different columns on the Rented bike count.
  • Each point corresponds to the mean of bike rented by every unique value of the column
In [28]:
textbox = widgets.Dropdown(
    description='Column:   ',
    value="Temperature(°C)",
    options=["Visibility (10m)","Hour","Temperature(°C)","Humidity(%)","Wind speed (m/s)","Solar Radiation (MJ/m2)"]
)

X = df.groupby("Temperature(°C)")["Rented Bike Count"].mean().reset_index()["Temperature(°C)"]
Y = df.groupby("Temperature(°C)")["Rented Bike Count"].mean().reset_index()["Rented Bike Count"]
#Assign an empty figure widget with two traces
trace = px.scatter(x=X, y=Y,title="Rented bike mean, by temperature",labels={'x':"Temperature(°C)", 'y':'mean of bikes'})

g = go.FigureWidget(data=trace,
                    layout=go.Layout(
                        title=dict(
                            text='scatter')))

def validate():
    if textbox.value in ["Visibility (10m)","Hour","Temperature(°C)","Humidity(%)","Wind speed (m/s)","Solar Radiation (MJ/m2)"]:
        return True
    else:
        return False


def response(change):
    if validate():                        
        x1 = df.groupby(textbox.value)["Rented Bike Count"].mean().reset_index()[textbox.value]
        y1 = df.groupby(textbox.value)["Rented Bike Count"].mean().reset_index()["Rented Bike Count"]
        with g.batch_update():
            g.data[0].x=x1
            g.data[0].y=y1
            g.layout.title = "Rented bike mean, by "+str(textbox.value)
            g.layout.xaxis.title = textbox.value
            g.layout.yaxis.title = 'mean of Rented bike'

textbox.observe(response, names="value")
In [29]:
container2 = widgets.HBox([textbox])
widgets.VBox([container2,g])

These graphs can help us visualize correlation for variables we didn't analyse yet.

  • Bar plot in order to compare the number of bike rented by hour
  • between two different month
In [30]:
@interact
def view_image(
    col=widgets.Dropdown(
        description="Month #1 :", value="July", options=df["Month"].unique()
    ),
    filtercol=widgets.Dropdown(
        description="Month #2 :", value="August", options=df["Month"].unique()
    ),
):
    temp1 = (df[df["Month"]==col].groupby(["Hour","Month"])["Rented Bike Count"].sum()).reset_index()
    temp2 = (df[df["Month"]==filtercol].groupby(["Hour","Month"])["Rented Bike Count"].sum()).reset_index()
    temp = temp1.append(temp2)
    newTitle= "Bar plot of the total number of bike rented by hour between "+col+" and "+filtercol
    fig = px.bar(temp, x="Hour", y="Rented Bike Count", color="Month",title=newTitle).update_layout(barmode="group")
    go.FigureWidget(fig.to_dict()).show()

This graph is really helpful when we want to compare two months together. Side by side

Machine Learning

The goal of this project is to predict an output when we give a model some parameters. In our case, we want to predict how many bikes will likely be rented on given parameters (Hour, season...).

This is why we use machine learning : we will try different models that will predict the output with a certain accuracy. In the end, we aim to seek the best machine learning model that fits our dataset and that gives us the most accurate prediction. We will save it to a Pickle file in order to use it in our flask API.

Here are the different steps :

  • Preprocessing
  • Definition of results storing methods
  • Regressions model :
    • Linear regression
    • Robust regression
    • Ridge regression
    • Random forest
    • SVM regression
  • Model selection
  • Model tuning
  • Model saving

We used this format in order to show the results :

Libraries

In [31]:
import sklearn
from sklearn.model_selection import train_test_split

from sklearn import metrics
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RandomizedSearchCV

# models
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import RANSACRegressor
from sklearn.svm import SVR
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import Lasso
from sklearn.linear_model import Ridge

Data set

In [32]:
X = df[['Hour','Temperature(°C)','Humidity(%)','Wind speed (m/s)','Visibility (10m)',
        'Solar Radiation (MJ/m2)','Seasons','Holiday','Month','Snow','Rain']]

y = df['Rented Bike Count']
In [33]:
X.head()
Out[33]:
Hour Temperature(°C) Humidity(%) Wind speed (m/s) Visibility (10m) Solar Radiation (MJ/m2) Seasons Holiday Month Snow Rain
0 0 -5.2 37 2.2 2000 0.0 Winter No Holiday December False No rain
1 1 -5.5 38 0.8 2000 0.0 Winter No Holiday December False No rain
2 2 -6.0 39 1.0 2000 0.0 Winter No Holiday December False No rain
3 3 -6.2 40 0.9 2000 0.0 Winter No Holiday December False No rain
4 4 -6.0 36 2.3 2000 0.0 Winter No Holiday December False No rain

In order to be able to train the models we need to dummify the categorical predictors

Encoding categorical predictors

In [34]:
X["Holiday"] = X["Holiday"].map( {'No Holiday': 0, 'Holiday': 1} ).astype(int)
X["Rain"] = X["Rain"].map( {'No rain': 0, 'light rain': 1,'Medium rain':2,'Strong rain':3} ).astype(int)
X["Month"] = X["Month"].map({"January":1, "February":2, "March":3, "April":4, "May":5,
                            "June":6, "July":7, "August":8, "September":9, "October":10,
                            "November":11, "December":12}).astype(int)
X["Snow"] = X["Snow"].map( {False: 0, True: 1} ).astype(int)

#We dummify Visibility as we concluded during analysis, same for solar radiation
X['Visibility (10m)']=X['Visibility (10m)'].apply(lambda x: 1 if x==2000 else 0)
X['Solar Radiation (MJ/m2)']=X['Solar Radiation (MJ/m2)'].apply(lambda x:1 if x>=0.5 else 0)

seasons = pd.get_dummies(X.Seasons)

X = pd.merge(X, seasons, left_index=True, right_index=True)
X.drop(columns="Seasons",inplace=True)
C:\Users\Nicolas Carval\Anaconda3\lib\site-packages\ipykernel_launcher.py:1: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

C:\Users\Nicolas Carval\Anaconda3\lib\site-packages\ipykernel_launcher.py:2: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

C:\Users\Nicolas Carval\Anaconda3\lib\site-packages\ipykernel_launcher.py:5: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

C:\Users\Nicolas Carval\Anaconda3\lib\site-packages\ipykernel_launcher.py:6: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

C:\Users\Nicolas Carval\Anaconda3\lib\site-packages\ipykernel_launcher.py:9: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

C:\Users\Nicolas Carval\Anaconda3\lib\site-packages\ipykernel_launcher.py:10: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

In [35]:
X.head()
Out[35]:
Hour Temperature(°C) Humidity(%) Wind speed (m/s) Visibility (10m) Solar Radiation (MJ/m2) Holiday Month Snow Rain Autumn Spring Summer Winter
0 0 -5.2 37 2.2 1 0 0 12 0 0 0 0 0 1
1 1 -5.5 38 0.8 1 0 0 12 0 0 0 0 0 1
2 2 -6.0 39 1.0 1 0 0 12 0 0 0 0 0 1
3 3 -6.2 40 0.9 1 0 0 12 0 0 0 0 0 1
4 4 -6.0 36 2.3 1 0 0 12 0 0 0 0 0 1

Splitting in test and train sets

In [36]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=13)

Results storing methods

In [37]:
def cross_val(model):
    pred = cross_val_score(model, X, y, cv=10)
    return pred.mean()

def print_evaluate(true, predicted):  
    mae = metrics.mean_absolute_error(true, predicted)
    mse = metrics.mean_squared_error(true, predicted)
    rmse = np.sqrt(metrics.mean_squared_error(true, predicted))
    r2_square = metrics.r2_score(true, predicted)
    print('MAE:', mae)
    print('MSE:', mse)
    print('RMSE:', rmse)
    print('R2 Square', r2_square)
    print('__________________________________')
    
def evaluate(true, predicted):
    mae = metrics.mean_absolute_error(true, predicted)
    mse = metrics.mean_squared_error(true, predicted)
    rmse = np.sqrt(metrics.mean_squared_error(true, predicted))
    r2_square = metrics.r2_score(true, predicted)
    return mae, mse, rmse, r2_square

Fitting multiple models

Linear regression

In [38]:
lin_reg = LinearRegression()
lin_reg.fit(X_train,y_train);
In [39]:
lin_test_pred = lin_reg.predict(X_test)
lin_train_pred = lin_reg.predict(X_train)

print('Test set evaluation:\n_____________________________________')
print_evaluate(y_test, lin_test_pred)
print('Train set evaluation:\n_____________________________________')
print_evaluate(y_train, lin_train_pred)
Test set evaluation:
_____________________________________
MAE: 316.53596626611306
MSE: 177153.46686032347
RMSE: 420.8960285632587
R2 Square 0.5669228126461434
__________________________________
Train set evaluation:
_____________________________________
MAE: 323.19611493633107
MSE: 187748.52884557727
RMSE: 433.2995832511004
R2 Square 0.5465708248198713
__________________________________
In [40]:
results_df = pd.DataFrame(data=[["Linear Regression", *evaluate(y_test, lin_test_pred) , cross_val(LinearRegression())]], 
                          columns=['Model', 'MAE', 'MSE', 'RMSE', 'R2 Square', "Cross Validation"])

Robust Regression

In [41]:
rb_reg = RANSACRegressor(base_estimator=LinearRegression(), max_trials=100)
rb_reg.fit(X_train, y_train)

rb_test_pred = rb_reg.predict(X_test)
rb_train_pred = rb_reg.predict(X_train)

print('Test set evaluation:\n_____________________________________')
print_evaluate(y_test, rb_test_pred)
print('====================================')
print('Train set evaluation:\n_____________________________________')
print_evaluate(y_train, rb_train_pred)
Test set evaluation:
_____________________________________
MAE: 391.833521115557
MSE: 342948.245051674
RMSE: 585.6178319106019
R2 Square 0.16161357715893288
__________________________________
====================================
Train set evaluation:
_____________________________________
MAE: 388.3325467296809
MSE: 351211.5738438772
RMSE: 592.6310604785048
R2 Square 0.15179322458112843
__________________________________
In [42]:
results_df_2 = pd.DataFrame(data=[["Robust Regression", *evaluate(y_test, rb_test_pred) , cross_val(RANSACRegressor())]], 
                            columns=['Model', 'MAE', 'MSE', 'RMSE', 'R2 Square', "Cross Validation"])
results_df = results_df.append(results_df_2, ignore_index=True)

Ridge Regression

In [43]:
ridge_reg = Ridge(alpha=100, solver='cholesky', tol=0.0001, random_state=42)
ridge_reg.fit(X_train, y_train)

ridge_test_pred = ridge_reg.predict(X_test)
ridge_train_pred = ridge_reg.predict(X_train)

print('Test set evaluation:\n_____________________________________')
print_evaluate(y_test, ridge_test_pred)
print('====================================')
print('Train set evaluation:\n_____________________________________')
print_evaluate(y_train, ridge_train_pred)
Test set evaluation:
_____________________________________
MAE: 315.98443721955937
MSE: 177453.30652250646
RMSE: 421.2520700513013
R2 Square 0.5661898113684563
__________________________________
====================================
Train set evaluation:
_____________________________________
MAE: 322.4911601049207
MSE: 188224.58145730325
RMSE: 433.8485697306184
R2 Square 0.545421115981115
__________________________________
In [44]:
results_df_2 = pd.DataFrame(data=[["Ridge Regression", *evaluate(y_test, ridge_test_pred) , cross_val(Ridge())]], 
                            columns=['Model', 'MAE', 'MSE', 'RMSE', 'R2 Square', "Cross Validation"])
results_df = results_df.append(results_df_2, ignore_index=True)

Lasso

In [45]:
lasso_reg = Lasso(alpha=0.1, 
              precompute=True, 
#               warm_start=True, 
              positive=True, 
              selection='random',
              random_state=42)
lasso_reg.fit(X_train, y_train)

lasso_test_pred = lasso_reg.predict(X_test)
lasso_train_pred = lasso_reg.predict(X_train)

print('Test set evaluation:\n_____________________________________')
print_evaluate(y_test, lasso_test_pred)
print('====================================')
print('Train set evaluation:\n_____________________________________')
print_evaluate(y_train, lasso_train_pred)
Test set evaluation:
_____________________________________
MAE: 341.8390278337629
MSE: 205953.8702906379
RMSE: 453.8214079245688
R2 Square 0.496516074729515
__________________________________
====================================
Train set evaluation:
_____________________________________
MAE: 346.266402557663
MSE: 214695.85910001665
RMSE: 463.3528451407379
R2 Square 0.4814906571844336
__________________________________
In [46]:
results_df_2 = pd.DataFrame(data=[["Lasso Regression", *evaluate(y_test, lasso_test_pred) , cross_val(Lasso())]], 
                            columns=['Model', 'MAE', 'MSE', 'RMSE', 'R2 Square', "Cross Validation"])
results_df = results_df.append(results_df_2, ignore_index=True)

Random forest

In [47]:
rf_reg = RandomForestRegressor(n_estimators=100)
rf_reg.fit(X_train, y_train)

rf_test_pred = rf_reg.predict(X_test)
rf_train_pred = rf_reg.predict(X_train)

print('Test set evaluation:\n_____________________________________')
print_evaluate(y_test, rf_test_pred)

print('Train set evaluation:\n_____________________________________')
print_evaluate(y_train, rf_train_pred)
Test set evaluation:
_____________________________________
MAE: 139.8958779527559
MSE: 50304.18577838582
RMSE: 224.28594645760984
R2 Square 0.8770241662489956
__________________________________
Train set evaluation:
_____________________________________
MAE: 52.45140421940929
MSE: 7492.475801569621
RMSE: 86.55908849779797
R2 Square 0.9819050133513586
__________________________________
In [48]:
results_df_2 = pd.DataFrame(data=[["Random Forest Regressor", *evaluate(y_test, rf_test_pred), 0]], 
                            columns=['Model', 'MAE', 'MSE', 'RMSE', 'R2 Square', 'Cross Validation'])
results_df = results_df.append(results_df_2, ignore_index=True)

Support Vector Machine

In [49]:
svm_reg = SVR(kernel='rbf', C=1000000, epsilon=0.001)
svm_reg.fit(X_train, y_train)

svm_test_pred = svm_reg.predict(X_test)
svm_train_pred = svm_reg.predict(X_train)

print('Test set evaluation:\n_____________________________________')
print_evaluate(y_test, svm_test_pred)

print('Train set evaluation:\n_____________________________________')
print_evaluate(y_train, svm_train_pred)
Test set evaluation:
_____________________________________
MAE: 191.98828119563044
MSE: 98668.1227968321
RMSE: 314.11482422329595
R2 Square 0.7587915502888344
__________________________________
Train set evaluation:
_____________________________________
MAE: 187.7769387799055
MSE: 103427.46890952502
RMSE: 321.60141310250026
R2 Square 0.7502135851238172
__________________________________
In [50]:
results_df_2 = pd.DataFrame(data=[["SVM Regressor", *evaluate(y_test, svm_test_pred), 0]], 
                            columns=['Model', 'MAE', 'MSE', 'RMSE', 'R2 Square', 'Cross Validation'])
results_df = results_df.append(results_df_2, ignore_index=True)

Model selection

In [51]:
results_df[['Model',"RMSE","R2 Square"]].sort_values(by='R2 Square').style.background_gradient(cmap = 'coolwarm')
Out[51]:
  Model RMSE R2 Square
1 Robust Regression 585.617832 0.161614
3 Lasso Regression 453.821408 0.496516
2 Ridge Regression 421.252070 0.566190
0 Linear Regression 420.896029 0.566923
5 SVM Regressor 314.114824 0.758792
4 Random Forest Regressor 224.285946 0.877024
In [52]:
f, ax = plt.subplots(figsize=(15, 10),nrows=2,ncols=1)

clrs = ['#5499c7' for x in range(len(results_df)) ]
sns.barplot(x="Model", y="RMSE", data=results_df,palette=clrs,ax=ax[0])
mask2 = results_df["RMSE"]==results_df["RMSE"].min()
sns.barplot(x="Model", y=results_df["RMSE"][mask2],data=results_df, color='green',ax=ax[0])
ax[0].set_title('RMSE by model')

sns.barplot(x="Model", y="R2 Square", data=results_df,palette=clrs,ax=ax[1])
mask2 = results_df["R2 Square"]==results_df["R2 Square"].max()
sns.barplot(x="Model", y=results_df["R2 Square"][mask2],data=results_df, color='green',ax=ax[1])
ax[1].set_title('R2 Square by model');

Looking at RandomForest closer

Plot of Actual vs Predicted output

In [53]:
plt.figure(figsize=(10,6))

plt.scatter(rf_train_pred,rf_train_pred - y_train,
          c = 'black', marker = 'o', s = 35, alpha = 0.5,
          label = 'Train data')
plt.scatter(rf_test_pred,rf_test_pred - y_test,
          c = 'c', marker = 'o', s = 35, alpha = 0.7,
          label = 'Test data')
plt.xlabel('Predicted values')
plt.ylabel('Tailings')
plt.legend(loc = 'upper left')
plt.hlines(y = 0, xmin = 0, xmax = 3500, lw = 2, color = 'red')
plt.title("Scatter plot of the error of predictions")
plt.show()

This plot enables us to understand for which kind of values we have the most signiiificant errors. On the train set we see that the model is pretty stable, following the red line, errors are massed around 0 with max error of 500 bikes.

However for the test set we see that the error spreads a lot more when predicting big values, the maximum error is now higher to 1500 bikes. Again, it is pretty correct for values predicted between 0 and 1500.

Lets check if the errors observed are normal, if our model is correct, and we juste need to tune it or not.

In [54]:
plt.figure(figsize=(10,6))
sns.displot(y_test-rf_test_pred)
plt.title("repartition of the prediction error");
<Figure size 720x432 with 0 Axes>

So it looks like a normal distribution, no outliers are present, it centered in 0. Thus our model is not incorrect we can continue with its improvment.

Model tuning

Random gridsearch

In [275]:
n_estimators = [1000]
max_features = ['auto']
max_depth =  [int(x) for x in np.linspace(10,100,10)]
min_samples_split = [2,5,15,30]
min_samples_leaf = [1,2,5,20]

random_grid = {
    'n_estimators': n_estimators,
    'max_features': max_features,
    'max_depth': max_depth,
    'min_samples_split': min_samples_split,
    'min_samples_leaf': min_samples_leaf
}

# First create the base model to tune
rf = RandomForestRegressor()

rf_random = RandomizedSearchCV(estimator = rf, param_distributions = random_grid,scoring='neg_mean_squared_error', 
                               n_iter = 5, cv = 3, random_state=42, n_jobs = 1)

rf_random.fit(X_train,y_train);
In [276]:
rf_random_test_pred = rf_random.predict(X_test)
rf_random_train_pred = rf_random.predict(X_train)

print('Test set evaluation:\n_____________________________________')
print_evaluate(y_test, rf_random_test_pred)

print('Train set evaluation:\n_____________________________________')
print_evaluate(y_train, rf_random_train_pred)
Test set evaluation:
_____________________________________
MAE: 144.11846750879187
MSE: 53315.40472354186
RMSE: 230.9012878343078
R2 Square 0.8696628074543465
__________________________________
Train set evaluation:
_____________________________________
MAE: 100.78816187579896
MSE: 26772.530428548504
RMSE: 163.6231353707308
R2 Square 0.9353419892856454
__________________________________

Random Forest with 1000 trees

In [53]:
rf2_reg = RandomForestRegressor(n_estimators=1000)
rf2_reg.fit(X_train, y_train)

rf2_test_pred = rf2_reg.predict(X_test)
rf2_train_pred = rf2_reg.predict(X_train)

print('Test set evaluation:\n_____________________________________')
print_evaluate(y_test, rf2_test_pred)

print('Train set evaluation:\n_____________________________________')
print_evaluate(y_train, rf2_train_pred)
Test set evaluation:
_____________________________________
MAE: 139.74534173228346
MSE: 50554.181303141726
RMSE: 224.8425700421113
R2 Square 0.8764130161505462
__________________________________
Train set evaluation:
_____________________________________
MAE: 52.24538345991561
MSE: 7320.688665734851
RMSE: 85.56102305217517
R2 Square 0.9823198943615431
__________________________________

Random Forest with 1500 trees

In [278]:
rf3_reg = RandomForestRegressor(n_estimators=1500)
rf3_reg.fit(X_train, y_train)

rf3_test_pred = rf3_reg.predict(X_test)
rf3_train_pred = rf3_reg.predict(X_train)

print('Test set evaluation:\n_____________________________________')
print_evaluate(y_test, rf3_test_pred)

print('Train set evaluation:\n_____________________________________')
print_evaluate(y_train, rf3_train_pred)
Test set evaluation:
_____________________________________
MAE: 140.5512194225722
MSE: 51572.786940957834
RMSE: 227.09642652617376
R2 Square 0.8739228878314883
__________________________________
Train set evaluation:
_____________________________________
MAE: 51.95985552742616
MSE: 7160.954874487089
RMSE: 84.62242536400791
R2 Square 0.982705665486671
__________________________________

Selecting the final model

In [279]:
results_df = pd.DataFrame(data=[["Random Forest", *evaluate(y_test, rf_test_pred)]], 
                            columns=['Model', 'MAE', 'MSE', 'RMSE', 'R2 Square'])


results_df_2 = pd.DataFrame(data=[["Random Forest tuned randomly", *evaluate(y_test, rf_random_test_pred)]], 
                            columns=['Model', 'MAE', 'MSE', 'RMSE', 'R2 Square'])
results_df = results_df.append(results_df_2, ignore_index=True)

results_df_2 = pd.DataFrame(data=[["Random Forest (ntree=1000)", *evaluate(y_test, rf2_test_pred)]], 
                            columns=['Model', 'MAE', 'MSE', 'RMSE', 'R2 Square'])
results_df = results_df.append(results_df_2, ignore_index=True)

results_df_2 = pd.DataFrame(data=[["Random Forest (ntree=1500)", *evaluate(y_test, rf3_test_pred)]], 
                            columns=['Model', 'MAE', 'MSE', 'RMSE', 'R2 Square'])
results_df = results_df.append(results_df_2, ignore_index=True)
In [289]:
f, ax = plt.subplots(figsize=(15, 10),nrows=2,ncols=1)

clrs = ['#5499c7' for x in range(len(results_df)) ]
sns.barplot(x="Model", y="RMSE", data=results_df,palette=clrs,ax=ax[0])
mask2 = results_df["RMSE"]==results_df["RMSE"].min()
sns.barplot(x="Model", y=results_df["RMSE"][mask2],data=results_df, color='green',ax=ax[0])
ax[0].set_ylim(200,240)
ax[0].set_title('RMSE by model')

sns.barplot(x="Model", y="R2 Square", data=results_df,palette=clrs,ax=ax[1])
mask2 = results_df["R2 Square"]==results_df["R2 Square"].max()
sns.barplot(x="Model", y=results_df["R2 Square"][mask2],data=results_df, color='green',ax=ax[1])
ax[1].set_ylim(0.6,0.9)
ax[1].set_title('R2 Square by model');

We chose

In [54]:
final_model = rf3_reg # rf with ntree = 1500

After analyzing the data and removing the irrelevant columns, we tried different kind of machine learning models to see which one suits our dataset the best. In the end, we can clearly see that the Random Forest model provides us the best results (r²:0.87 / RMSE:222) on the test set, and that is why we take the decision to use this model in our flask API.

After trying to improve this model, we finally concluded that the number of estimators was the only parameter of improvement for the model, it is great to increase it until 1500, as the improvement is not significant anymore after this level (increases the R2 square by 0.000001).

Our model is efficient when predicting values under 1500, when it is larger it doesn't match reality. This can be the result of a number of rows of high value (higher than 1500) not big enough to train the model correctly for huge values.

If we take the MAE metric, we can tell that our predictions are correct with +-139 bikes of error.

In fact if we look at the dataset, we see that 75% of the rows corresponds to values (number of bikes rented) under 1085. Which is too narrowed.

We trained our model on a dataset excluding values above 1500 and the MAE was divided by 2, without losing too much of R2 square. It shows that if we had a dataset with more diversed values for the target, then we would surely have better results.

In [16]:
f, ax = plt.subplots(figsize=(10, 5))
sns.histplot(data=df,x="Rented Bike Count");

Saving the model

In [55]:
import pickle
In [56]:
with open("model.pkl","wb") as f:
    pickle.dump(final_model,f)

An example of the usage :

In [57]:
with open("model.pkl","rb") as f:
    clf2 = pickle.load(f)
In [58]:
print("Predicted : ",round(clf2.predict(X_test)[20])," vs  Actual : ",y_test.iloc[20])
Predicted :  2212  vs  Actual :  2267